Problem Statement
Imports/packages
EDA (Exploratory Data Analysis)
Data Preprocessing
Model Creation/Testing
The task of the competition is to predict the temperature at the next hour given the weather conditions and temperature for prior hours. This is a multivariate timeseries forecasting problem, and we use a neural network approach to tackle the problem.
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
import plotly as py
import plotly.graph_objs as go
%matplotlib inline
X_train = pd.read_csv("climate_hour_train.csv")
X_train.head() #peak at the data
Below we have some basic statistics:
X_train.describe()
Let's check if there are any null values in the data through a heatmap. If we are missing any data (through NaN or null values), it will be highlighted in the heatmap as yellow:
sns.heatmap(X_train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Below we have different distribution for some of the variables. On the top left we have temperature, which is our target variable in Celcius. We can see that the distribution is slighlty bimodal but is fairly normally distributed across. We can also see some of the variables that would have more of an impact if I was meteorologist, but visually we can see that their distributions are fairly similar to each other.
sns.set(style="white", palette="muted", color_codes=True)
# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(7,7))
sns.despine(left=True)
# Plot a simple histogram with binsize determined automatically
sns.distplot(X_train['T (degC)'], kde=False, color="b", ax=axes[0, 0])
# Plot a kernel density estimate and rug plot
sns.distplot(X_train['H2OC (mmol/mol)'], hist=False, rug=True, color="r", ax=axes[0, 1])
# Plot a filled kernel density estimate
sns.distplot(X_train['VPact (mbar)'], hist=False, color="g", kde_kws={"shade": True}, ax=axes[1, 0])
# Plot a historgram and kernel density estimate
sns.distplot(X_train['VPmax (mbar)'], color="m", ax=axes[1, 1])
plt.setp(axes, yticks=[])
plt.tight_layout()
In the pairplot below, we can see the correlation between all the variables. If we look at temperature specifically (the second row) the predictors are fairly correlated, with the exception of a few weaker ones.
sns.pairplot(X_train.drop("Date Time",axis=1))
Alternatively for a more intuitive view over time (this being a time-series problem), we can see the temperature change (in Celcius) from the first date to the last of the training set (Since the actual data was already in order by date, I used the index for the X axis of this graph).
plt.figure(figsize=(22,6))
sns.lineplot(x=X_train.index, y=X_train['T (degC)'])
plt.title('Temperature Change Over Time')
plt.show()
Let's zoom in to the first 24 hours, since that's what we're going to be timestepping by:
plt.plot(range(24), X_train["T (degC)"][:24])
We know that the dataset starts at midnight, which would account for lower temperatures (since the sun's not up obviously), and we can see that it gradually rises.
from scipy import stats
pd.set_option('display.max_columns', None)
import tensorflow
X_train = pd.read_csv("climate_hour_train.csv")
X_test1 = pd.read_csv("climate_Xtest.csv",header=None)
X_test2 = pd.read_csv("climate_Xtest2.csv",header=None)
X_test3 = pd.read_csv("climate_Xtest3.csv",header=None)
# Had to split up test file into 3 separate files and concatenate them because
# the original file was too big for IBM cognitive class
numeric_cols = X_train.select_dtypes(include=[np.number]).columns
X_train_Norm = X_train[numeric_cols].apply(stats.zscore) #Z score normalization train
X_test = pd.concat([X_test1, X_test2], axis = 1)
X_test = pd.concat([X_test, X_test3], axis = 1)
X_test = X_test.apply(stats.zscore) #Z score normalization test
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = pd.DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = pd.concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
x_train = series_to_supervised(X_train_Norm,0,24,True) #reformat for shape
x_train.drop(52542,axis=0,inplace=True) #needed to drop a row for shape
y_train = np.asarray(x_train['var2(t+23)']) #target temp
x_train = np.asarray(x_train)
x_train = x_train.reshape((52542,24,14)) #keras ready
y_train = y_train.reshape((y_train.shape[0], 1))
y_train.shape
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM
data_dim = 14
timesteps = 24
model = Sequential()
model.add(LSTM(80,input_shape=(timesteps,data_dim))) #80 seems to be sweet spot
model.add(Dropout(0.06)) # dropout- anything bigger than 10 gave worse results?
model.add(Dense(1, activation='linear')) #linear shows best results for activation
#simplest seems to be best
model.compile(loss='mae', #MAE, tried different optimizers SGD, Adadelta, etc
optimizer='adam',
metrics=['accuracy'])
model.summary()
model.fit(x_train, y_train, batch_size=80, epochs=25,shuffle=False) #tried a couple of different epochs, 80 seams to be the best
predictions = pd.DataFrame(model.predict(np.asarray(X_test).reshape((17447,24,14)),batch_size=80)).apply(lambda x: (x * 8.01432) + 10.25075)
#final predictions, reconversion to temp from normalized
predictions.to_csv("predictions.csv")
loss = [0.0216,0.0221,0.0219,0.0214,0.0220,0.0211,0.0207,0.0220,0.0213,0.0211,0.0208,0.0213,0.0209,0.0212,
0.0210,0.0209,0.0211,0.0211,0.0203,0.0217,0.0207,0.0206,0.0218,0.0204,0.0216]
sns.lineplot(range(1,26),loss)
loss2 = [0.0960,0.0326,0.0248,0.0220,0.0222,0.0212,0.0205,0.0207,0.0202,0.0210,0.0194,0.0201,0.0200,0.0200
,0.0198,0.0194,0.0197,0.0195,0.0192,0.0198,0.0191,0.0189,0.0199,0.0192,0.0188]
plt.ylim(0.018,0.024)
sns.lineplot(range(1,26),loss2)